Importing Standard Libraries

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import warnings
warnings.filterwarnings("ignore")

Reading the Data

In [2]:
df = pd.read_csv('vehicle-1.csv')
data = df.copy()
df.head()
Out[2]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus

Name of the Variables and Shape of the Data

In [3]:
df.columns
Out[3]:
Index(['compactness', 'circularity', 'distance_circularity', 'radius_ratio',
       'pr.axis_aspect_ratio', 'max.length_aspect_ratio', 'scatter_ratio',
       'elongatedness', 'pr.axis_rectangularity', 'max.length_rectangularity',
       'scaled_variance', 'scaled_variance.1', 'scaled_radius_of_gyration',
       'scaled_radius_of_gyration.1', 'skewness_about', 'skewness_about.1',
       'skewness_about.2', 'hollows_ratio', 'class'],
      dtype='object')
In [4]:
df.shape
Out[4]:
(846, 19)

Exploratory Data Analysis

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 846 entries, 0 to 845
Data columns (total 19 columns):
compactness                    846 non-null int64
circularity                    841 non-null float64
distance_circularity           842 non-null float64
radius_ratio                   840 non-null float64
pr.axis_aspect_ratio           844 non-null float64
max.length_aspect_ratio        846 non-null int64
scatter_ratio                  845 non-null float64
elongatedness                  845 non-null float64
pr.axis_rectangularity         843 non-null float64
max.length_rectangularity      846 non-null int64
scaled_variance                843 non-null float64
scaled_variance.1              844 non-null float64
scaled_radius_of_gyration      844 non-null float64
scaled_radius_of_gyration.1    842 non-null float64
skewness_about                 840 non-null float64
skewness_about.1               845 non-null float64
skewness_about.2               845 non-null float64
hollows_ratio                  846 non-null int64
class                          846 non-null object
dtypes: float64(14), int64(4), object(1)
memory usage: 125.7+ KB
In [6]:
df.isnull().sum()
Out[6]:
compactness                    0
circularity                    5
distance_circularity           4
radius_ratio                   6
pr.axis_aspect_ratio           2
max.length_aspect_ratio        0
scatter_ratio                  1
elongatedness                  1
pr.axis_rectangularity         3
max.length_rectangularity      0
scaled_variance                3
scaled_variance.1              2
scaled_radius_of_gyration      2
scaled_radius_of_gyration.1    4
skewness_about                 6
skewness_about.1               1
skewness_about.2               1
hollows_ratio                  0
class                          0
dtype: int64
Observation
  1. The Dataset consists of 19 variables with 846 rows
  2. Except for class column which is a Categorical column every other column is a Numerical column
  3. The Dataset has missing values for multiple columns. The columns are circularity, distance_circularity, radius_ratio, pr.axis_aspect_ratio, scatter_ratio, elongatedness, pr.axis_rectangularity, scaled_variance, scaled_variance.1, scaled_radius_of_gyration, scaled_radius_of_gyration.1, skewness_about, skewness_about.1 and skewness_about.2
  4. Class column is the Target variable & should not be considered for PCA

Handling Categorical Column

In this case it is the Target column i.e. Class Column

In [7]:
TARGET = "class"
df[TARGET].value_counts()
Out[7]:
car    429
bus    218
van    199
Name: class, dtype: int64
In [8]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df[TARGET] = labelencoder.fit_transform(df[TARGET])
df[TARGET].value_counts()
Out[8]:
1    429
0    218
2    199
Name: class, dtype: int64
Observation: After label encoding we identify 0 = bus, 1 = car & 2 = van

Handling Missing Values

Since the number of missing values in not too large. We will replace the missing values with the MEDIAN of the column

In [9]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='median', verbose=1)
#fill missing values with mean column values
transformed_values = imputer.fit_transform(df)
column = df.columns
df = pd.DataFrame(transformed_values, columns = column )
In [10]:
print("Data after treating missing value : ")
df.isnull().sum()
Data after treating missing value : 
Out[10]:
compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64
In [11]:
df.describe().transpose()
Out[11]:
count mean std min 25% 50% 75% max
compactness 846.0 93.678487 8.234474 73.0 87.00 93.0 100.00 119.0
circularity 846.0 44.823877 6.134272 33.0 40.00 44.0 49.00 59.0
distance_circularity 846.0 82.100473 15.741569 40.0 70.00 80.0 98.00 112.0
radius_ratio 846.0 168.874704 33.401356 104.0 141.00 167.0 195.00 333.0
pr.axis_aspect_ratio 846.0 61.677305 7.882188 47.0 57.00 61.0 65.00 138.0
max.length_aspect_ratio 846.0 8.567376 4.601217 2.0 7.00 8.0 10.00 55.0
scatter_ratio 846.0 168.887707 33.197710 112.0 147.00 157.0 198.00 265.0
elongatedness 846.0 40.936170 7.811882 26.0 33.00 43.0 46.00 61.0
pr.axis_rectangularity 846.0 20.580378 2.588558 17.0 19.00 20.0 23.00 29.0
max.length_rectangularity 846.0 147.998818 14.515652 118.0 137.00 146.0 159.00 188.0
scaled_variance 846.0 188.596927 31.360427 130.0 167.00 179.0 217.00 320.0
scaled_variance.1 846.0 439.314421 176.496341 184.0 318.25 363.5 586.75 1018.0
scaled_radius_of_gyration 846.0 174.706856 32.546277 109.0 149.00 173.5 198.00 268.0
scaled_radius_of_gyration.1 846.0 72.443262 7.468734 59.0 67.00 71.5 75.00 135.0
skewness_about 846.0 6.361702 4.903244 0.0 2.00 6.0 9.00 22.0
skewness_about.1 846.0 12.600473 8.930962 0.0 5.00 11.0 19.00 41.0
skewness_about.2 846.0 188.918440 6.152247 176.0 184.00 188.0 193.00 206.0
hollows_ratio 846.0 195.632388 7.438797 181.0 190.25 197.0 201.00 211.0
class 846.0 0.977541 0.702130 0.0 0.00 1.0 1.00 2.0
Observations
  • Since for majority of the columns the mean and median (50%) are almost similar, it signifies that they are normally distributed and their are no outliers present in these variables.
  • Few columns like scatter_ratio, scaled_variance & scaled_variance.1 indicates presence of outliers (to be treated later) hence there will be some skewness present in these variables. We will evaluate it later in detail.
In [12]:
sns.pairplot(df, diag_kind='kde')
Out[12]:
<seaborn.axisgrid.PairGrid at 0x24c9cd36b88>
Inferences
  • We can see that there are many features which are highly correlated (positively as well as negatively) among each other. The columns compactness, circularity, distance_circularity, radius_ratio, scatter_ratio, elongatedness, pr.axis_rectangularity, max.length_rectangularity, scaled_variance, scaled_variance.1, scaled_radius_of_gyration are highly correlated.
  • Columns pr.axis_aspect_ratio and max.length_aspect_ratio do not have correlation with majority of other columns.
  • Columns scaled_radius_of_gyration.1 has high negative correlation with skewness_about.2 and hollows_ratio columns. It has very less corelation with other columns.
  • Column skewness_about.2 is positively correlated with the hollows_ratio column.
  • We also notice presence of outliers in some dimensions as the points seems far from othe cluster of points.
  • To summarize, there are lot of dimensions which seems correlated to each other visually and we will need to perform Correlation analysis to make more inferences about the data.

Numerical Variables

In [13]:
#Method to show Distribution & Box plot for the variable along with skewness
def showPlots(df, col):
    fig,(ax1,ax2) = plt.subplots(nrows=1,ncols=2)
    fig.set_size_inches(10,2)
    sns.distplot(df[col],ax=ax1)
    ax1.set_title("Distribution Plot")
    sns.boxplot(df[col],ax=ax2)
    ax2.set_title("Box Plot")
    print(df[col].skew())
In [14]:
showPlots(df, 'compactness')
0.38127063263996497
Observation: compactness column seems normally distributed.
In [15]:
showPlots(df, 'circularity')
0.2649279874901165
Observation: circularity column seems normally distributed with multiple guassian's.
In [16]:
showPlots(df, 'distance_circularity')
0.10871801180935975
Observation: distance_circularity column seems normally distributed with multiple guassian's.
In [17]:
showPlots(df, 'radius_ratio')
0.3975716412698015
Observation: radius_ratio column is right skewed with outliers.
In [18]:
showPlots(df, 'pr.axis_aspect_ratio')
3.8353916077858434
Observation: pr.axis_aspect_ratio column is right skewed with outliers.
In [19]:
showPlots(df, 'max.length_aspect_ratio')
6.7783936191089476
Observation: max.length_aspect_ratio column is right skewed with outliers.
In [20]:
showPlots(df, 'scatter_ratio')
0.6087097328672928
Observation: scatter_ratio column seems normally distributed with multiple guassian's.
In [21]:
showPlots(df, 'elongatedness')
0.046951051315584164
Observations: elongatedness column seems normally distributed with multiple guassian's.
In [22]:
showPlots(df, 'pr.axis_rectangularity')
0.7744056757899445
Observations: pr.axis_rectangularity column seems normally distributed with multiple guassian's.
In [23]:
showPlots(df, 'max.length_rectangularity')
0.2563591641353724
Observations: max.length_rectangularity column seems normally distributed.
In [24]:
showPlots(df, 'scaled_variance')
0.6555976294220067
Observations: scaled_variance column seems normally distributed with multiple guassian's.There is a outlier present.
In [25]:
showPlots(df, 'scaled_variance.1')
0.8453454281630146
Observations: scaled_variance.1 column is lightly right skewed with outliers.
In [26]:
showPlots(df, 'scaled_radius_of_gyration')
0.27990964799345835
Observations: scaled_radius_of_gyration column seems normally distributed.
In [27]:
showPlots(df, 'scaled_radius_of_gyration.1')
2.0899787533912066
Observations: scaled_radius_of_gyration.1 column is highly right skewed with outliers.
In [28]:
showPlots(df, 'skewness_about')
0.7808132397211246
Observations: skewness_about column is moderately right skewed with outliers.
In [29]:
showPlots(df, 'skewness_about.1')
0.6890143067342678
Observations: skewness_about.1 column is lighty right skewed with an outlier.
In [30]:
showPlots(df, 'skewness_about.2')
0.24998506992542593
Observations: skewness_about.2 column seems normally distributed.
In [31]:
showPlots(df, 'hollows_ratio')
-0.22634128032982512
Observations: hollows_ratio column seems normally distributed with multiple guassian's.

Let see the summary of the data to further understand the numerical variables.

Observations:

From the above graphs, it can be seen that the columns radius_ratio, pr.axis_aspect_ratio, max.length_aspect_ratio, scaled_variance, scaled_variance.1, scaled_radius_of_gyration.1, skewness_about, skewness_about.1 have outliers. Therefore, we will treat outliers of these columns before proceeding further.

Outliers Treatment

In [32]:
#We will handle the outliers using IQR for all the columns
from scipy.stats import iqr

def handleOutliers(odf):
    Q1 = odf.quantile(0.25)
    Q3 = odf.quantile(0.75)
    IQR = Q3 - Q1
    cleandf = odf[~((odf < (Q1 - 1.5 * IQR)) | (odf > (Q3 + 1.5 * IQR))).any(axis=1)]
    print(cleandf.shape)
    return cleandf
In [33]:
df = handleOutliers(df)
(813, 19)

Target Column Distribution

In [34]:
df.hist(column=TARGET)
Out[34]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000024CB0768348>]],
      dtype=object)
Observation
  • The columns "class" has categorical data having 3 different values of 0 for bus, 1 for car & 2 for van.
  • The graph shows the distribution of the data across the three classes. Class type car i.e 1 takes up 50% of the data within the Vehicle Data Set. While van and bus have near about similar counts within the data set.

Correlation Comparison

In [35]:
corr = df.corr()
# select only the lower triangle of the correlation matrix
lower_triangle = np.tril(corr) 
# to mask the upper triangle in the following heatmap
mask = lower_triangle == 0  

plt.figure(figsize = (25,14))
sns.set(font_scale=1.8)
# Setting it to white so that we do not see the grid lines
sns.set_style(style = 'white')  
sns.heatmap(lower_triangle, center=0.5, cmap= 'coolwarm', annot= True, xticklabels = corr.index, yticklabels = corr.columns,
            cbar= False, mask = mask, linecolor='white', vmax=.8, fmt='.2f',linewidths=0.01)

#Logic to manage the known issue with Matplotlib version 3.1.11 w.r.t. Heat Map
b, t = plt.ylim() # discover the values for bottom and top
b += 0.5 # Add 0.5 to the bottom
t -= 0.5 # Subtract 0.5 from the top
plt.ylim(b, t) # update the ylim(bottom, top) values
plt.show()
sns.set(font_scale=1)

Observations

Strong or Very High Correlations ( > 0.9 )

Few important observations are

  • circularity is correlated with max.length_rectangularity & scaled_radius_of_gyration
  • distance_circularity is positively correlated with scatter_ratio & negtively with elongatedness
  • scatter_ratio is correlated with scaled_variance.1, scaled_variance, pr.axis_rectangularity & negtively with elongatedness
  • elongatedness is negatively correlated to scaled_variance.1, scaled_variance & pr.axis_rectangularity
  • pr.axis_rectangularity is correlated with scaled_variance.1 & scaled_variance
  • scaled_variance is correlated with scaled_variance.1
  • scaled_radius_of_gyration.1 is negatively correlated with hollows_ratio & skewness_about.2
High Correlations ( > 0.7 till 0.9 )

Few important observations are

  • The columns compactness, circularity, distance_circularity, radius_ratio, scatter_ratio, elongatedness, pr.axis_rectangularity, max.length_rectangularity, scaled_variance, scaled_variance.1 & scaled_radius_of_gyration are highly correlated among each other.
  • skewness_about.2 is highly correlated with hollows_ratio
Low or No Correlations ( < 0.5 )

Few important observations are

  • pr.axis_aspect_ratio, max_length_aspect_ratio & scaled_radius_of_gyration seems to have very little correlation with all the other columns
  • scaled_radius_of_gyration & scaled_radius_of_gyration.1 seems to have almost no correlation
  • scaled_radius_of_gyration.1 & skewness_about seems to have almost no correlation
  • skewness_about, skewness_about.1 & skewness_about.2 are not correlated to each other.

Important Insight

  • The Pair plot as well as the Correlation Comparison helps us to infer that there are considerable number of columns that are strongly correlated to each other. So, either they need to dropped of treated carefully before we go for model building.
  • Our objective is to recognize whether the class is a van or bus or car based on some input features hence we need to ensure that there is little or no multicollinearity between the features. If our dataset has perfectly positive or negative attributes as can be obseverd from our correlation analysis, there is a high chance that the performance of the model will be impacted by Multicollinearity eventually leading to skewed or misleading results.

Resolve Multicollinearity

We are only considering a couple of most popular approaches used for resolving multicollinearity from the dataset

Approach 1
  • If two features are highly correlated then there is no point using both features & we can drop one feature. Based on above observations we can decide to get rid of those columns whose correlation is +-0.9 or above. There are 8 such columns:
    • max.length_rectangularity
    • scaled_radius_of_gyration
    • skewness_about.2
    • scatter_ratio
    • elongatedness
    • pr.axis_rectangularity
    • scaled_variance
    • scaled_variance.1
Approach 2
  • As we observed in our analysis that there are more than 50% attributes that are highly correlated with each other. Dropping or eliminating highly correlated columns could result in loss of information and hence we could prefer using the popular dimension reduction algorithm such as Principal Component Analysis (PCA)
We will be implementing both the approaches and finally evaluate them together.

Storing the processed dataset

In [36]:
processed_df = df

Model Building

In [37]:
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from prettytable import PrettyTable
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Common Methods

Model Execution Functions

In [38]:
SPLIT_VALUE = 0.30
SEED = 1501

# Function to split Target Variable from data
def SplitData(d):
    #Set of Independent Variables
    X=d.drop(TARGET, axis=1)
    #Dependent Variable
    y=d[TARGET]
    return X,y
    
# Function to Scale data
def ScaleData(X):
    scaler = preprocessing.StandardScaler()
    return scaler.fit_transform(X)

# Function to split data into training & test set
def SplitTrainTestData(X, y):    
    Xtrain, Xtest, ytrain, ytest = train_test_split (X, y, test_size=SPLIT_VALUE, stratify=y, random_state=SEED)
    print("Training Data Shape: {0}".format(Xtrain.shape))
    print("Testing Data Shape: {0}".format(Xtest.shape))
    return Xtrain, Xtest, ytrain, ytest

# Function to fit the model
def ModelFit (model, Xtr, ytr):
    fit = model.fit(Xtr, ytr)
    print(fit)
    return model

# Function to predict from model
def ModelPredict (model, Xtt, ytt):
    pred = model.predict(Xtt)
    acc_scr = accuracy_score(ytt, pred)
    return pred, acc_scr

# Function to print the Results of the model
def PrintResults(model, pred, Xtr, Xtt, ytr, ytt):  
    x = PrettyTable()
    x.field_names = ["Metrics", "Results"]
    x.add_row(["Classification Report", classification_report(ytt, pred)])    
    x.add_row(["Accuracy Score", accuracy_score(ytt, pred)])
    x.add_row(["",""])
    x.add_row(["Confusion Matrix", confusion_matrix(ytt, pred)])
    x.add_row(["",""])
    x.add_row(["Training Data Score", model.score(Xtr, ytr)])
    x.add_row(["",""])
    x.add_row(["Testing Data Score", model.score(Xtt, ytt)])
    print(x)

Hyper Parameter Tuning Function

In [39]:
def SVCTuneHyperParams(svc, Xtr, ytr):
    svc = SVC()
    Cs = [0.1, 1, 10, 100]
    gammas = [0.01, 0.1, 1, 10]
    kernel = ['linear', 'rbf', 'poly']
    param = dict(kernel = kernel, C = Cs, gamma = gammas)
    gs = GridSearchCV(svc, param, cv=3, scoring='accuracy', n_jobs = -1)
    gs.fit(Xtr, ytr)

    svc_bestScore = gs.best_score_
    svc_bestParam = gs.best_params_

    #Creating new model with best Parameters and running on the data again
    k = svc_bestParam['kernel']
    C = svc_bestParam['C']
    g = svc_bestParam['gamma']
    svc = SVC(kernel = k, C=C, gamma =g, probability=True)
    svc.fit(Xtr, ytr)

    x = PrettyTable()
    x.field_names = ["Hyper Tuning", "Results"]
    x.add_row(["Best Accuracy", svc_bestScore])    
    x.add_row(["",""])
    x.add_row(["Best Parameter", svc_bestParam])
    print(x)
    return svc

Cross Validation Function

In [40]:
from sklearn import model_selection
SPLITS = 10

def KFoldCrossValidation (name, model, X, y, scoring):
    kfold = model_selection.KFold(n_splits=SPLITS)
    results = model_selection.cross_val_score(model, X, y, cv=kfold, scoring=scoring)
    print(results)
    mean, std = results.mean(), results.std()
    x = PrettyTable()
    x.field_names = ["Cross Validation", "Score Mean", "Score Standard Deviation"]
    x.add_row([name, mean, std])
    print(x)
    return mean, std

Model Evaluation Function

In [41]:
# A single Function to execute all the steps for Model Evaluation
def ModelEvaluation(X, y, name, model, hyperTuneFunc, scoring):   
    # Creating Training & Test Set
    Xtrain, Xtest, ytrain, ytest = SplitTrainTestData(X, y)
    
    # Standardize X Data
    Xtrain = ScaleData(Xtrain)
    Xtest = ScaleData(Xtest)
    
    # Model Training/Fitting
    model = ModelFit(model, Xtrain, ytrain)
    
    # Get Model Prediction & Accuracy Score
    pred, scr = ModelPredict(model, Xtest, ytest)
    
    # Results of model
    PrintResults(model, pred, Xtrain, Xtest, ytrain, ytest)
    
    # Perform Cross Validation
    mean, std = KFoldCrossValidation(name, model, X, y, scoring)
    
    # The code will tune the Hyper Parameter to retrieve the optimal value for the Model.
    hy_model = hyperTuneFunc(model, Xtrain, ytrain)
    
    # Get Model Prediction & Accuracy Score
    hy_pred, hy_scr = ModelPredict(hy_model, Xtest, ytest)
    
    # Results of model
    PrintResults(hy_model, hy_pred, Xtrain, Xtest, ytrain, ytest)
    
    return scr, mean, hy_scr

Import Libraries

In [42]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

Approach 1 - Drop Features

In [43]:
# Dropping the highly correlated columns from dataset
df = processed_df
droppedCols = ["max.length_rectangularity", "scaled_radius_of_gyration", "skewness_about.2", 
               "scatter_ratio", "elongatedness", "pr.axis_rectangularity", "scaled_variance", "scaled_variance.1"]
df1 = df.drop(droppedCols, axis=1)

# Splitting Data - Extract Target Column
X, y = SplitData(df1)

name = "Approach 1 - Drop Features"
model = SVC(kernel='linear', probability=True)
scoring = 'accuracy'
scr_a1, scr_cv_a1, scr_a1_hy = ModelEvaluation(X, y, name, model, SVCTuneHyperParams, scoring)
Training Data Shape: (569, 10)
Testing Data Shape: (244, 10)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=True, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
+-----------------------+-------------------------------------------------------+
|        Metrics        |                        Results                        |
+-----------------------+-------------------------------------------------------+
| Classification Report |               precision    recall  f1-score   support |
|                       |                                                       |
|                       |          0.0       0.77      0.81      0.79        62 |
|                       |          1.0       0.93      0.87      0.90       125 |
|                       |          2.0       0.85      0.93      0.89        57 |
|                       |                                                       |
|                       |     accuracy                           0.87       244 |
|                       |    macro avg       0.85      0.87      0.86       244 |
|                       | weighted avg       0.87      0.87      0.87       244 |
|                       |                                                       |
|     Accuracy Score    |                   0.8688524590163934                  |
|                       |                                                       |
|    Confusion Matrix   |                     [[ 50   6   6]                    |
|                       |                      [ 13 109   3]                    |
|                       |                     [  2   2  53]]                    |
|                       |                                                       |
|  Training Data Score  |                   0.8945518453427065                  |
|                       |                                                       |
|   Testing Data Score  |                   0.8688524590163934                  |
+-----------------------+-------------------------------------------------------+
[0.86585366 0.87804878 0.91463415 0.88888889 0.88888889 0.87654321
 0.90123457 0.85185185 0.90123457 0.87654321]
+----------------------------+--------------------+--------------------------+
|      Cross Validation      |     Score Mean     | Score Standard Deviation |
+----------------------------+--------------------+--------------------------+
| Approach 1 - Drop Features | 0.8843721770551038 |   0.017573691941427935   |
+----------------------------+--------------------+--------------------------+
+----------------+-----------------------------------------+
|  Hyper Tuning  |                 Results                 |
+----------------+-----------------------------------------+
| Best Accuracy  |            0.9384885764499121           |
|                |                                         |
| Best Parameter | {'C': 1, 'gamma': 0.1, 'kernel': 'rbf'} |
+----------------+-----------------------------------------+
+-----------------------+-------------------------------------------------------+
|        Metrics        |                        Results                        |
+-----------------------+-------------------------------------------------------+
| Classification Report |               precision    recall  f1-score   support |
|                       |                                                       |
|                       |          0.0       0.89      1.00      0.94        62 |
|                       |          1.0       0.98      0.94      0.96       125 |
|                       |          2.0       0.98      0.91      0.95        57 |
|                       |                                                       |
|                       |     accuracy                           0.95       244 |
|                       |    macro avg       0.95      0.95      0.95       244 |
|                       | weighted avg       0.95      0.95      0.95       244 |
|                       |                                                       |
|     Accuracy Score    |                   0.9508196721311475                  |
|                       |                                                       |
|    Confusion Matrix   |                     [[ 62   0   0]                    |
|                       |                      [  6 118   1]                    |
|                       |                     [  2   3  52]]                    |
|                       |                                                       |
|  Training Data Score  |                   0.9648506151142355                  |
|                       |                                                       |
|   Testing Data Score  |                   0.9508196721311475                  |
+-----------------------+-------------------------------------------------------+

Approach 2 - PCA

Common Methods

In [44]:
from sklearn.decomposition import PCA

# Function to perform PCA analysis based on supplied parameters
def PCAFit(X, component):
    # Standardize X Data
    X = ScaleData(X)
    
    # PCA technique implementation
    pca = PCA(n_components = component)
    pca.fit(X)
    
    # Print Results
    PrintPCAResults(pca)
    return pca, X

# Function to print the results of PCA Fit
def PrintPCAResults(m):
    x = PrettyTable()
    x.field_names = ["PCA Analysis", "Result"]
    x.add_row(["Eigen Values", m.explained_variance_])
    x.add_row(["",""])
    x.add_row(["Eigen Vectors", m.components_])
    x.add_row(["",""])
    x.add_row(["Variation Ratio", m.explained_variance_ratio_])
    print(x)
    
def PCATransform(X, component):
    #Perform PCA Fit
    pca, X = PCAFit(X, component)
    return pca.transform(X)

Approach 2 - Original Data

In [45]:
df = processed_df

# Splitting Original Data - Extract Target Column
X, y = SplitData(df)

name = "Approach 2 - Original Data"
model = SVC(kernel='linear', probability=True)
scoring = 'accuracy'
scr_a2, scr_cv_a2, scr_a2_hy = ModelEvaluation(X, y, name, model, SVCTuneHyperParams, scoring)
Training Data Shape: (569, 18)
Testing Data Shape: (244, 18)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=True, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
+-----------------------+-------------------------------------------------------+
|        Metrics        |                        Results                        |
+-----------------------+-------------------------------------------------------+
| Classification Report |               precision    recall  f1-score   support |
|                       |                                                       |
|                       |          0.0       0.89      0.92      0.90        62 |
|                       |          1.0       0.97      0.94      0.95       125 |
|                       |          2.0       0.93      0.96      0.95        57 |
|                       |                                                       |
|                       |     accuracy                           0.94       244 |
|                       |    macro avg       0.93      0.94      0.93       244 |
|                       | weighted avg       0.94      0.94      0.94       244 |
|                       |                                                       |
|     Accuracy Score    |                   0.9385245901639344                  |
|                       |                                                       |
|    Confusion Matrix   |                     [[ 57   2   3]                    |
|                       |                      [  7 117   1]                    |
|                       |                     [  0   2  55]]                    |
|                       |                                                       |
|  Training Data Score  |                   0.9630931458699473                  |
|                       |                                                       |
|   Testing Data Score  |                   0.9385245901639344                  |
+-----------------------+-------------------------------------------------------+
[0.92682927 0.95121951 0.98780488 0.96296296 0.95061728 0.9382716
 0.95061728 0.95061728 0.92592593 0.97530864]
+----------------------------+--------------------+--------------------------+
|      Cross Validation      |     Score Mean     | Score Standard Deviation |
+----------------------------+--------------------+--------------------------+
| Approach 2 - Original Data | 0.9520174646190906 |   0.018584139383693316   |
+----------------------------+--------------------+--------------------------+
+----------------+--------------------------------------------+
|  Hyper Tuning  |                  Results                   |
+----------------+--------------------------------------------+
| Best Accuracy  |             0.9859402460456942             |
|                |                                            |
| Best Parameter | {'C': 100, 'gamma': 0.01, 'kernel': 'rbf'} |
+----------------+--------------------------------------------+
+-----------------------+-------------------------------------------------------+
|        Metrics        |                        Results                        |
+-----------------------+-------------------------------------------------------+
| Classification Report |               precision    recall  f1-score   support |
|                       |                                                       |
|                       |          0.0       0.94      0.95      0.94        62 |
|                       |          1.0       0.99      0.96      0.98       125 |
|                       |          2.0       0.92      0.96      0.94        57 |
|                       |                                                       |
|                       |     accuracy                           0.96       244 |
|                       |    macro avg       0.95      0.96      0.95       244 |
|                       | weighted avg       0.96      0.96      0.96       244 |
|                       |                                                       |
|     Accuracy Score    |                   0.9590163934426229                  |
|                       |                                                       |
|    Confusion Matrix   |                     [[ 59   0   3]                    |
|                       |                      [  3 120   2]                    |
|                       |                     [  1   1  55]]                    |
|                       |                                                       |
|  Training Data Score  |                   0.9982425307557118                  |
|                       |                                                       |
|   Testing Data Score  |                   0.9590163934426229                  |
+-----------------------+-------------------------------------------------------+

PCA

As a start point we will perform PCA Fit considering the components are same as the number of columns in the dataset

In [46]:
df = processed_df

# Splitting Original Data - Extract Target Column
X, y = SplitData(df)

pca, X = PCAFit(df, len(df.columns) - 1)
+-----------------+---------------------------------------------------------------------+
|   PCA Analysis  |                                Result                               |
+-----------------+---------------------------------------------------------------------+
|   Eigen Values  |  [9.8298015  3.44128316 1.65599217 1.19615816 0.97616542 0.65974346 |
|                 |   0.3864265  0.29761637 0.19704424 0.10406877 0.0759178  0.06026822 |
|                 |  0.04187279 0.02920509 0.02379233 0.01812963 0.01667134 0.01011893] |
|                 |                                                                     |
|  Eigen Vectors  |  [[ 2.70691468e-01  2.84800745e-01  2.99997141e-01  2.72356336e-01  |
|                 |     9.93248898e-02  1.90714137e-01  3.10714628e-01 -3.08950949e-01  |
|                 |     3.07552412e-01  2.74880720e-01  3.03338914e-01  3.07270112e-01  |
|                 |     2.61849314e-01 -4.07378116e-02  3.54416416e-02  5.84807516e-02  |
|                 |             3.36207648e-02  8.00593344e-02 -6.47218129e-02]         |
|                 |   [-9.76802238e-02  1.22194970e-01 -5.37351869e-02 -1.94248137e-01  |
|                 |    -2.37782809e-01 -1.26065372e-01  7.18504422e-02 -1.65938579e-02  |
|                 |     8.14445608e-02  1.07689954e-01  7.36461380e-02  7.76785254e-02  |
|                 |     2.03229985e-01  5.05327397e-01 -2.69841879e-02 -9.17492848e-02  |
|                 |            -4.88004760e-01 -5.06289988e-01 -1.64516382e-01]         |
|                 |   [-7.11834443e-02 -1.19953092e-01 -8.55483973e-02  2.06555223e-01  |
|                 |     3.62061993e-01 -4.43810618e-01  5.25236470e-02 -1.14782911e-01  |
|                 |     2.52954482e-02 -2.21050553e-01  1.17419966e-01  6.09627633e-02  |
|                 |    -5.66947745e-02  6.81505269e-02 -2.84625922e-01  4.30911964e-02  |
|                 |             1.47802609e-01 -3.30410076e-02 -6.37101241e-01]         |
|                 |   [ 2.70599786e-02 -1.90768097e-01  1.15222715e-01 -7.34035348e-02  |
|                 |    -3.55548492e-01  4.52332778e-02  9.42473117e-02 -4.24668364e-02  |
|                 |     1.04436172e-01 -1.55314880e-01  9.49053983e-02  9.20815052e-02  |
|                 |    -2.22203180e-01  2.62786184e-03 -3.10123642e-01  7.72360988e-01  |
|                 |            -8.91129844e-02  6.61157041e-03  4.96421203e-02]         |
|                 |   [ 1.80161630e-01 -8.86948703e-02 -2.89370576e-02 -4.86269293e-02  |
|                 |    -2.42167549e-01 -3.20339891e-01  4.04140518e-02 -1.38860747e-02  |
|                 |     4.72776577e-02 -1.34329614e-01  5.79972221e-02  6.07374905e-02  |
|                 |    -9.14854141e-03  1.11182895e-02  8.21655241e-01  1.82812486e-01  |
|                 |             1.65379677e-01 -1.07302128e-02 -1.72746115e-01]         |
|                 |   [ 2.44849313e-01 -5.54386600e-02 -1.44874240e-02 -1.58978346e-01  |
|                 |    -6.13813169e-01 -2.11854801e-01  8.12443550e-02 -6.65606543e-02  |
|                 |     8.77628273e-02 -4.19490857e-02  1.18291487e-01  1.14986332e-01  |
|                 |    -6.99163769e-02 -1.51454779e-01 -3.19623181e-01 -4.82619707e-01  |
|                 |             2.67849869e-01  7.67545735e-02 -3.99016570e-02]         |
|                 |   [ 3.80926907e-01 -2.77053345e-01  3.26184528e-02  2.26373017e-01  |
|                 |     2.71541313e-01 -1.93094653e-01  7.63882853e-02  8.40792023e-03  |
|                 |     1.06471066e-01 -2.65289093e-01  1.34736816e-01  1.20111124e-01  |
|                 |    -2.65807094e-01  2.29135728e-01 -6.61101051e-03 -1.60160012e-01  |
|                 |            -1.00555935e-01 -2.25888650e-01  5.33495593e-01]         |
|                 |   [-1.74673290e-01 -3.00786291e-01  2.22072857e-01  5.71758831e-02  |
|                 |    -6.84953300e-02  5.33185669e-01  8.79159611e-02 -1.96523532e-01  |
|                 |     5.25280216e-02 -3.00931270e-01  4.63589706e-02  2.72570336e-02  |
|                 |    -3.38105502e-01 -8.87116612e-02  1.80538101e-01 -2.87179485e-01  |
|                 |            -2.32281257e-01 -6.62676753e-02 -3.15109614e-01]         |
|                 |   [ 6.35378382e-01  3.32163339e-02 -3.23326404e-01 -1.22824504e-01  |
|                 |     7.73535385e-02  3.43543608e-01 -5.34891145e-02  7.90268639e-02  |
|                 |    -3.98529864e-02  2.09378013e-01 -1.73132528e-01 -7.28037470e-02  |
|                 |    -2.65655461e-01  2.88588892e-01 -3.94544894e-02  6.11547394e-02  |
|                 |             7.40213407e-02  3.81680153e-02 -3.09291926e-01]         |
|                 |   [ 4.51754193e-01  6.11312005e-02  1.52091082e-01  1.32973275e-01  |
|                 |    -3.21507985e-02 -2.87164554e-02 -1.39197284e-01  1.90286514e-01  |
|                 |    -1.33394314e-01 -2.62103384e-01 -9.97430078e-02 -1.71003178e-01  |
|                 |     3.27766033e-01 -4.97793243e-01 -6.73691307e-02  4.10880924e-02  |
|                 |            -3.57124378e-01 -2.36598460e-01 -1.52542571e-01]         |
|                 |   [-9.48317419e-02  2.49024602e-01 -1.31524099e-01  9.63153652e-02  |
|                 |     2.17789037e-02 -2.26854177e-01  1.10791589e-01  1.13969438e-01  |
|                 |     2.05615173e-01  3.65602392e-01 -1.58313733e-01  1.49387798e-01  |
|                 |    -5.69469107e-01 -4.24652820e-01  5.31298739e-02 -8.62019772e-03  |
|                 |            -2.42327002e-01 -1.86789806e-01 -2.75655887e-02]         |
|                 |   [ 8.39844913e-02  1.25975237e-01  7.45224689e-01 -1.23125271e-01  |
|                 |     3.49982655e-02 -1.68043225e-01 -1.70826307e-01 -6.86046140e-02  |
|                 |    -2.85770222e-01  2.19356572e-01 -4.72913586e-04 -2.47793477e-01  |
|                 |    -2.86634774e-01  1.50841072e-01 -1.23864574e-02 -4.65935360e-03  |
|                 |             1.52304902e-01 -1.42095721e-01 -5.95496507e-02]         |
|                 |   [ 8.05236021e-02 -1.26209764e-01  2.96734396e-01 -3.55302902e-01  |
|                 |     1.60574764e-01 -1.87258619e-01  7.38523442e-02  2.19136823e-01  |
|                 |     2.83855597e-01 -3.99533917e-02 -3.09015216e-01  1.99619604e-01  |
|                 |     5.73159372e-02  9.55581750e-02 -4.93824021e-03 -8.15542469e-02  |
|                 |            -3.40167702e-01  5.40169825e-01 -6.98018506e-02]         |
|                 |   [-1.90878821e-02 -3.38521046e-01  7.79208634e-02 -4.75007415e-01  |
|                 |     2.60562235e-01  1.37907266e-01  1.96454974e-02  2.48450169e-01  |
|                 |     2.20298693e-01  1.56253928e-01  1.02047455e-01  1.85994018e-01  |
|                 |     1.47872310e-01 -2.37160944e-01 -1.27260764e-02  3.23140532e-02  |
|                 |             3.29490694e-01 -4.44510274e-01 -2.97569485e-02]         |
|                 |   [ 1.10283175e-01 -2.46775114e-01 -1.87761778e-01 -3.73647683e-01  |
|                 |     1.62372614e-01 -1.50332432e-01  3.97929117e-02 -6.28880934e-01  |
|                 |    -2.01029283e-01  2.15524516e-01  1.81656859e-01 -1.90379510e-01  |
|                 |     2.18812500e-02 -2.28818164e-01  1.91760023e-02  1.89315203e-02  |
|                 |            -2.81964956e-01  1.02377385e-01  7.98623040e-02]         |
|                 |   [-1.34756611e-02 -5.94017112e-01  6.63731146e-02  4.40466445e-01  |
|                 |    -1.77691426e-01 -6.44521881e-02 -2.63637542e-02 -4.28297158e-02  |
|                 |     1.20638154e-01  4.64000332e-01 -3.46686682e-01 -1.48272606e-01  |
|                 |     1.79832533e-01  5.16821425e-02 -1.23782984e-02 -4.47217650e-03  |
|                 |             1.62963308e-03 -3.78921570e-02 -4.04903162e-02]         |
|                 |   [ 1.00568890e-02  2.02488764e-01  2.14127833e-02 -1.13910441e-01  |
|                 |     3.70730457e-02 -1.29305336e-03  1.21941599e-01 -4.72043301e-01  |
|                 |     2.41208327e-01 -2.74074555e-01 -6.71533929e-01 -5.17804043e-02  |
|                 |     5.60169057e-02  8.94377416e-04 -3.31806122e-02  3.71446856e-02  |
|                 |             2.17792326e-01 -2.41655786e-01  8.94684963e-02]         |
|                 |   [ 1.02673415e-02 -8.34709418e-02  2.08169034e-02  2.58150841e-02  |
|                 |    -7.94973537e-03  1.11502747e-02  2.73968406e-01 -6.05489241e-03  |
|                 |    -6.84010212e-01  4.05986804e-02 -2.58669238e-01  6.12478783e-01  |
|                 |     4.42796950e-02 -1.65454427e-02 -5.28153867e-03  1.16681151e-02  |
|                 |            9.29526805e-03 -4.98978861e-02  3.40664969e-03]]         |
|                 |                                                                     |
| Variation Ratio |  [0.51672162 0.18089739 0.08705028 0.06287826 0.05131393 0.03468063 |
|                 |   0.02031322 0.01564475 0.01035799 0.00547057 0.00399076 0.00316811 |
|                 |  0.00220112 0.00153522 0.00125069 0.00095302 0.00087636 0.00053192] |
+-----------------+---------------------------------------------------------------------+

Principal Component Estimation

In [47]:
# Plotting the variance expalained by the principal components and the cumulative variance explained.
expVar = pca.explained_variance_ratio_
length = len(pca.explained_variance_ratio_) + 1
cExpVar = np.cumsum(pca.explained_variance_ratio_)

plt.figure(figsize=(10 , 5))
plt.bar(range(1, length), expVar, alpha = 0.5, align = 'center', label = 'Individual explained variance')
plt.step(range(1, length), cExpVar, where='mid', label = 'Cumulative explained variance')
plt.ylabel('Explained Variance Ratio')
plt.xlabel('Principal Components')
plt.legend(loc = 'best')
plt.tight_layout()
plt.show()

Observation: The graph clearly shows that 7 Principal Components seems reasonable to capture more than 95% of the variance in data

Dimensionality Reduction

In [48]:
df = processed_df

# Splitting Original Data - Extract Target Column
X, y = SplitData(df)

pcadf = PCATransform(X, 7)
+-----------------+-----------------------------------------------------------------------------+
|   PCA Analysis  |                                    Result                                   |
+-----------------+-----------------------------------------------------------------------------+
|   Eigen Values  |      [9.7929757  3.37710644 1.20873054 1.1365956  0.89628686 0.65829313     |
|                 |                                  0.32305653]                                |
|                 |                                                                             |
|  Eigen Vectors  |  [[ 0.27225105  0.28537005  0.30148623  0.27259451  0.09857976  0.19475579  |
|                 |     0.31051844 -0.30843834  0.30754849  0.27630107  0.30274811  0.30704063  |
|                 |     0.26152049 -0.04363236  0.0367057   0.05885041  0.03483739  0.08281362] |
|                 |   [-0.08972848  0.13317394 -0.04402596 -0.20423223 -0.25913686 -0.09457563  |
|                 |     0.07233508 -0.01168768  0.08409153  0.12583663  0.07019986  0.07793366  |
|                 |     0.20992728  0.50391445 -0.01456825 -0.09339805 -0.50166421 -0.50654656] |
|                 |   [-0.02260451 -0.21080994  0.07087808  0.04021396 -0.11480523 -0.13931348  |
|                 |     0.1129247  -0.09003305  0.11106355 -0.21987769  0.14481876  0.11532395  |
|                 |    -0.21362744  0.06739209 -0.52162344  0.68717064 -0.06220695 -0.04080354] |
|                 |   [-0.13041903  0.02067855 -0.10742522  0.25295734  0.605228   -0.32253141  |
|                 |     0.01005404 -0.07991176 -0.01604649 -0.06665079  0.06980451  0.01736316  |
|                 |     0.07224572  0.13586056 -0.49012168 -0.38023248  0.03553916 -0.10300842] |
|                 |   [ 0.15232414 -0.13902259 -0.08073354  0.11901255  0.08321282 -0.62137607  |
|                 |     0.08124056 -0.07473792  0.0775021  -0.24614056  0.14958407  0.11511731  |
|                 |    -0.00754872  0.14052777  0.5898001   0.12779373  0.18158269 -0.11125624] |
|                 |   [ 0.25837458 -0.06889799 -0.02048009 -0.13944968 -0.58714549 -0.26562469  |
|                 |     0.08933352 -0.07258539  0.09605543 -0.06350149  0.1344589   0.12696867  |
|                 |    -0.07339618 -0.13192887 -0.31241509 -0.4825069   0.27522234  0.06057715] |
|                 |   [ 0.18879422 -0.39087124  0.17638455  0.15647445  0.10249295  0.39885179  |
|                 |     0.09142373 -0.10487575  0.09067234 -0.34966769  0.07547531  0.06996415  |
|                 |   -0.45585196  0.0790311   0.1301874  -0.31062929 -0.25955786 -0.17634877]] |
|                 |                                                                             |
| Variation Ratio |      [0.54338501 0.18738625 0.0670691  0.06306653 0.04973247 0.03652686     |
|                 |                                  0.01792551]                                |
+-----------------+-----------------------------------------------------------------------------+

Pair Plot

Showing the pair plot of the PCA Tranform Data with only 7 Principal Components

In [49]:
pca = pd.DataFrame(pcadf)
sns.pairplot(pca, diag_kind='kde')
Out[49]:
<seaborn.axisgrid.PairGrid at 0x24cb0bc78c8>
Observation: The Pair plot clearly indicates that there is no correlation between the columns as expected after PCA

Approach 2 - PCA Data

In [50]:
name = "Approach 2 - PCA Data"
model = SVC(kernel='linear', probability=True)
scoring = 'accuracy'
scr_a2_pca, scr_cv_a2_pca, scr_a2_hy_pca = ModelEvaluation(pca, y, name, model, SVCTuneHyperParams, scoring)
Training Data Shape: (569, 7)
Testing Data Shape: (244, 7)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
    kernel='linear', max_iter=-1, probability=True, random_state=None,
    shrinking=True, tol=0.001, verbose=False)
+-----------------------+-------------------------------------------------------+
|        Metrics        |                        Results                        |
+-----------------------+-------------------------------------------------------+
| Classification Report |               precision    recall  f1-score   support |
|                       |                                                       |
|                       |          0.0       0.75      0.89      0.81        62 |
|                       |          1.0       0.94      0.80      0.87       125 |
|                       |          2.0       0.83      0.95      0.89        57 |
|                       |                                                       |
|                       |     accuracy                           0.86       244 |
|                       |    macro avg       0.84      0.88      0.86       244 |
|                       | weighted avg       0.87      0.86      0.86       244 |
|                       |                                                       |
|     Accuracy Score    |                   0.8565573770491803                  |
|                       |                                                       |
|    Confusion Matrix   |                     [[ 55   5   2]                    |
|                       |                      [ 16 100   9]                    |
|                       |                     [  2   1  54]]                    |
|                       |                                                       |
|  Training Data Score  |                   0.859402460456942                   |
|                       |                                                       |
|   Testing Data Score  |                   0.8565573770491803                  |
+-----------------------+-------------------------------------------------------+
[0.84146341 0.91463415 0.85365854 0.86419753 0.79012346 0.87654321
 0.79012346 0.85185185 0.87654321 0.83950617]
+-----------------------+--------------------+--------------------------+
|    Cross Validation   |     Score Mean     | Score Standard Deviation |
+-----------------------+--------------------+--------------------------+
| Approach 2 - PCA Data | 0.8498644986449865 |   0.03627430485591257    |
+-----------------------+--------------------+--------------------------+
+----------------+------------------------------------------+
|  Hyper Tuning  |                 Results                  |
+----------------+------------------------------------------+
| Best Accuracy  |            0.9226713532513181            |
|                |                                          |
| Best Parameter | {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'} |
+----------------+------------------------------------------+
+-----------------------+-------------------------------------------------------+
|        Metrics        |                        Results                        |
+-----------------------+-------------------------------------------------------+
| Classification Report |               precision    recall  f1-score   support |
|                       |                                                       |
|                       |          0.0       0.88      0.98      0.93        62 |
|                       |          1.0       0.98      0.91      0.95       125 |
|                       |          2.0       0.92      0.95      0.93        57 |
|                       |                                                       |
|                       |     accuracy                           0.94       244 |
|                       |    macro avg       0.93      0.95      0.94       244 |
|                       | weighted avg       0.94      0.94      0.94       244 |
|                       |                                                       |
|     Accuracy Score    |                   0.9385245901639344                  |
|                       |                                                       |
|    Confusion Matrix   |                     [[ 61   1   0]                    |
|                       |                      [  6 114   5]                    |
|                       |                     [  2   1  54]]                    |
|                       |                                                       |
|  Training Data Score  |                   0.9718804920913884                  |
|                       |                                                       |
|   Testing Data Score  |                   0.9385245901639344                  |
+-----------------------+-------------------------------------------------------+

Evaluate Performance of all Models

In [51]:
from prettytable import PrettyTable
x = PrettyTable()
x.field_names = ["Model Name", "Accuracy Score", "Cross Validation Score", "Accuracy Score - Hyper Tuning"]
x.add_row(["A1: Approach 1 - Drop Features", scr_a1, scr_cv_a1, scr_a1_hy])
x.add_row(["A2: Approach 2 - Original Data", scr_a2, scr_cv_a2, scr_a2_hy])
x.add_row(["A2-PCA: Approach 2 - PCA Data", scr_a2_pca, scr_cv_a2_pca, scr_a2_hy_pca])
print(x)
+--------------------------------+--------------------+------------------------+-------------------------------+
|           Model Name           |   Accuracy Score   | Cross Validation Score | Accuracy Score - Hyper Tuning |
+--------------------------------+--------------------+------------------------+-------------------------------+
| A1: Approach 1 - Drop Features | 0.8688524590163934 |   0.8843721770551038   |       0.9508196721311475      |
| A2: Approach 2 - Original Data | 0.9385245901639344 |   0.9520174646190906   |       0.9590163934426229      |
| A2-PCA: Approach 2 - PCA Data  | 0.8565573770491803 |   0.8498644986449865   |       0.9385245901639344      |
+--------------------------------+--------------------+------------------------+-------------------------------+
A1: Approach 1 - Drop Features :- In this case we have dropped the highly correlated columns from the dataset before Model Building
A2: Approach 2 - Original Data :- In this case we are using the original dataset for Model Building
A2-PCA: Approach 2 - PCA Data :- In this case we performed Dimensionality Reduction by PCA and used the data for Model Building

Conclusion

  • According to the comparison table shown above A2 has performed the best. It has the highest accuracy score at almost 94%, cross validation score at 95% & accuracy score after hyper tuning the model at almost 96%.
  • A1 has also performed fairly well considering we have dropped 8 correlated columns from the dataset. The accuracy score of almost 87% & cross validation score 88% are encouraging. The striking point is the accuracy score after hyper tuning which is very close to A2 ie. 95%. This indicates that if a well tuned model is selected then A1 can perform equally well despite loss of information.
  • In A2-PCA model we were able to carry out dimensionality reduction successfully wherein we reduced the dataset from 18 to just 7 columns. The model has performed well with expected drop in accuracy but it also gives us many benefits like improved performance & efficiency, decrease requirement of memory, capacity & computation time and most important one is non collinearity of data which reduces overfitting & improves visualization.
  • A2-PCA model has good accuracy score at almost 86% & Cross validation score at 85%. This model also jumped in accuracy after hyper tuning similar to A1. The accuracy improved to almost 94%. Here too appropriate model selection could make a difference.
  • Overall A2-PCA model has delivered good results along with the noteworthy advantages that it bring to the table.
In [ ]: